NLP - Word Embedding

  • Created by Andrés Segura Tinoco
  • Created on June 09, 2019

Word Embedding: is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension. Source

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

Example with a document in English

In [1]:
# Load Python libraries
import io
import re
import pandas as pd
import random
import numpy as np
from collections import Counter
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')
In [2]:
# Load NLP libraries from gensim and spacy
from gensim.models import Word2Vec
import spacy.lang.en as en
In [3]:
# Load Plot libraries
from wordcloud import WordCloud
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Step 1 - Read natural text from a book

In [4]:
# Util function to read a plain text file
def read_text_file(file_path):
    text = ""
    with io.open(file_path, 'r', encoding = 'ISO-8859-1') as f:
        text = f.read()
    
    return text;
In [5]:
# Get text sample
file_path = "../data/en/The Adventures of Sherlock Holmes - Arthur Conan Doyle.txt"
plain_text = read_text_file(file_path)
len(plain_text)
Out[5]:
576467
In [6]:
# Show first 1000 characters of document
plain_text[:1000]
Out[6]:
"\nProject Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.net\n\n\nTitle: The Adventures of Sherlock Holmes\n\nAuthor: Arthur Conan Doyle\n\nRelease Date: November 29, 2002 [EBook #1661]\nLast Updated: May 20, 2019\n\nLanguage: English\n\nCharacter set encoding: UTF-8\n\n*** START OF THIS PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***\n\n\n\nProduced by an anonymous Project Gutenberg volunteer and Jose Menendez\n\n\n\ncover\n\n\n\nThe Adventures of Sherlock Holmes\n\n\n\nby Arthur Conan Doyle\n\n\n\nContents\n\n\n   I.     A Scandal in Bohemia\n   II.    The Red-Headed League\n   III.   A Case of Identity\n   IV.    The Boscombe Valley Mystery\n   V.     The Five Orange Pips\n   VI.    The Man with the Twisted Lip\n   VII.   The Adventure of the Blue C"

Step 2 - Tokenize and remove Stopwords

Data Quality process: refers to the cleaning process of input data so they have meaning and value.

In [7]:
# Cleaing the text
clean_text = plain_text.lower()
clean_text = clean_text.replace('\n', '.')
clean_text = re.sub('[^a-zA-Z.]', ' ', clean_text)
clean_text = re.sub(r'\s+', ' ', clean_text)
clean_text = re.sub(r'\.+', ".", clean_text)
clean_text[:1000]
Out[7]:
'.project gutenberg s the adventures of sherlock holmes by arthur conan doyle.this ebook is for the use of anyone anywhere at no cost and with.almost no restrictions whatsoever. you may copy it give it away or.re use it under the terms of the project gutenberg license included.with this ebook or online at www.gutenberg.net.title the adventures of sherlock holmes.author arthur conan doyle.release date november ebook .last updated may .language english.character set encoding utf . start of this project gutenberg ebook the adventures of sherlock holmes .produced by an anonymous project gutenberg volunteer and jose menendez.cover.the adventures of sherlock holmes.by arthur conan doyle.contents. i. a scandal in bohemia. ii. the red headed league. iii. a case of identity. iv. the boscombe valley mystery. v. the five orange pips. vi. the man with the twisted lip. vii. the adventure of the blue carbuncle. viii. the adventure of the speckled band. ix. the adventure of the engineer s thumb. x. th'
In [8]:
# Tokenize text in sentences
sentence_list = clean_text.split('.')
len(sentence_list)
Out[8]:
14592
In [9]:
# Tokenize sentences in words
word_list = [sentence.split() for sentence in sentence_list if len(sentence.split()) > 0]
word_list[:10]
Out[9]:
[['project',
  'gutenberg',
  's',
  'the',
  'adventures',
  'of',
  'sherlock',
  'holmes',
  'by',
  'arthur',
  'conan',
  'doyle'],
 ['this',
  'ebook',
  'is',
  'for',
  'the',
  'use',
  'of',
  'anyone',
  'anywhere',
  'at',
  'no',
  'cost',
  'and',
  'with'],
 ['almost', 'no', 'restrictions', 'whatsoever'],
 ['you', 'may', 'copy', 'it', 'give', 'it', 'away', 'or'],
 ['re',
  'use',
  'it',
  'under',
  'the',
  'terms',
  'of',
  'the',
  'project',
  'gutenberg',
  'license',
  'included'],
 ['with', 'this', 'ebook', 'or', 'online', 'at', 'www'],
 ['gutenberg'],
 ['net'],
 ['title', 'the', 'adventures', 'of', 'sherlock', 'holmes'],
 ['author', 'arthur', 'conan', 'doyle']]
In [10]:
# Count the words in a document and return the most N repeated
def count_words(sentences, n):
    words = Counter()
    
    for sent in sentences:
        for word in sent:
            words[word] += 1
    
    return words.most_common(n)
In [11]:
# Get the most common words in the document
n_words = count_words(word_list, 50)
df = pd.DataFrame.from_records(n_words, columns = ['word', 'quantity'])
df.head(10)
Out[11]:
word quantity
0 the 5636
1 i 3038
2 and 3020
3 to 2744
4 of 2661
5 a 2643
6 in 1766
7 that 1752
8 it 1737
9 you 1503
In [12]:
# Plot the most common words in the document
fig = plt.figure(figsize = (18, 6))
sns.barplot(x = 'word', y = 'quantity', data = df)
plt.title('The 50 Most Common Words in document')
plt.show()

- Stopwords: refers to the most common words in a language, which do not significantly affect the meaning of the text.

In [13]:
# Get English stopwords
stopwords_en = en.stop_words.STOP_WORDS
print(stopwords_en)
{'forty', 'see', 'rather', 'whereas', 'this', 'bottom', 'be', 'herself', 'keep', 'anywhere', 'further', 'seem', 'somehow', 'ten', 'thence', 'when', '’d', 'but', 'have', 'towards', 'did', 'along', 'may', 'across', 'neither', 'is', 'empty', 'had', 'about', 'my', 'something', 'always', 'although', 'nobody', 'name', 'very', 'well', 'somewhere', 'everyone', "'m", 'themselves', 'his', 'do', 'if', 'whether', 'all', 'can', 'via', '‘ll', 'part', 'could', 'eight', 'some', '‘re', 'side', 'he', 'anyway', 'being', 'due', 'say', 'six', 'what', 'whole', 'really', 'within', 'again', 'first', 'mine', 'few', 'hundred', 'five', 'for', 'everything', 'however', 'now', 'not', 'elsewhere', 'just', 'our', 'must', 'there', 'using', 'whereby', 'many', '’ve', 'become', 'give', 'of', 'two', 'at', 'might', 'thru', 'why', 'one', 'and', 'that', '‘m', 'too', 'whither', 'everywhere', 'both', 'even', 'much', 'afterwards', 'a', 'fifty', 'else', 'anyone', 'no', 'latterly', 'nor', 'someone', 'wherein', 'regarding', 'formerly', 'doing', 'himself', 'sometime', 'their', 'also', 'throughout', 'top', 'other', 'various', 'anything', 'three', 'hereupon', 'show', '’ll', 'several', 'myself', 'per', 'sometimes', 'thus', 'down', 'whom', 'against', 'him', 'last', 'itself', 'therein', 'where', 'they', '’m', 'above', 'becoming', 'except', 'often', 'each', 'n’t', 'next', 'hereby', 'either', 'eleven', "'ve", 'yourself', '’re', 'beforehand', "'ll", 'only', 'out', 'then', 'was', 'through', 'twenty', 'i', 'seems', 'we', 'while', 'therefore', 'how', 'were', 'amount', 'will', 'front', 'another', 'under', 'which', 'who', 'am', 'whoever', 'any', 'nevertheless', 'make', 'an', 'same', 'does', 'done', 'thereafter', 'put', 'used', 'seeming', 'anyhow', 'because', 'becomes', 'me', "'d", 'latter', 'them', 'as', 'mostly', '‘d', 'nine', 'seemed', 'perhaps', 'thereby', 'twelve', 'by', 'hereafter', 'none', 'would', "n't", 'hence', 'noone', 'from', 'serious', 'former', 'amongst', 'third', 'here', 'whatever', 'became', 'has', 'over', 'until', 'your', '’s', 'almost', 'please', 'toward', 'herein', 'still', 'to', 'back', 'ours', 'meanwhile', 'whereafter', 'yet', 'hers', 'in', 'whose', 'most', 'she', 'yours', 'n‘t', 'these', 'below', 'are', 'four', 'it', 'quite', 'ca', 'so', 'namely', 'made', 'since', 'take', 'thereupon', 'unless', 'you', 'on', 'call', 'full', 'together', 'nowhere', 'beside', 'the', 'least', 'wherever', 'alone', 'among', 'less', 'once', 're', 'whenever', 'off', 'or', "'s", 'ourselves', 'yourselves', 'her', 'ever', '‘s', 'indeed', 'during', 'its', 'should', 'others', 'every', 'up', 'besides', 'whereupon', 'enough', 'otherwise', '‘ve', "'re", 'been', 'already', 'sixty', 'fifteen', 'between', 'though', 'into', 'more', 'moreover', 'never', 'upon', 'those', 'us', 'move', 'after', 'go', 'behind', 'beyond', 'own', 'such', 'get', 'nothing', 'with', 'without', 'than', 'cannot', 'whence', 'onto', 'around', 'before'}
In [14]:
# Remove stopwords
all_words = []
for ix in range(len(word_list)):
    all_words.append([word for word in word_list[ix] if (word not in stopwords_en and len(word) > 2)])

all_words[:10]
Out[14]:
[['project',
  'gutenberg',
  'adventures',
  'sherlock',
  'holmes',
  'arthur',
  'conan',
  'doyle'],
 ['ebook', 'use', 'cost'],
 ['restrictions', 'whatsoever'],
 ['copy', 'away'],
 ['use', 'terms', 'project', 'gutenberg', 'license', 'included'],
 ['ebook', 'online', 'www'],
 ['gutenberg'],
 ['net'],
 ['title', 'adventures', 'sherlock', 'holmes'],
 ['author', 'arthur', 'conan', 'doyle']]
In [15]:
# Get the most common words in the document after removing the stopwords
n_words = count_words(all_words, 50)
df = pd.DataFrame.from_records(n_words, columns = ['word', 'quantity'])
df.head(10)
Out[15]:
word quantity
0 said 486
1 holmes 465
2 man 305
3 little 269
4 think 174
5 room 171
6 know 170
7 shall 169
8 come 161
9 time 151
In [16]:
# Plot the most common words in the document after removing the stopwords
fig = plt.figure(figsize = (18, 6))
sns.barplot(x = 'word', y = 'quantity', data = df)
plt.title('The 50 Most Common Words in document')
plt.show()
In [17]:
# Reconstructing the clean text (without stop-words)
new_clean_text = ''
for sent in all_words:
    for word in sent:
        new_clean_text = new_clean_text + word + ' '
In [18]:
# Custom color function
def color_func(word, font_size, position, orientation, random_state = None, **kwargs):
    return "hsl(45, 150%%, %d%%)" % random.randint(160, 255)

# Create a Word cloud
wc = WordCloud(max_font_size = 60, min_font_size = 5, max_words = 150, background_color = "black", margin = 2)
wc = wc.generate(new_clean_text)

# Plot a Word cloud
plt.figure(figsize = (12, 12))
plt.imshow(wc.recolor(color_func = color_func, random_state=3), interpolation = "bilinear")
plt.axis("off")
plt.show()

Step 3 - Create a Word2Vec model

- Word2Vec consists of models for generating word embedding. These models are shallow two layer neural networks having one input layer, one hidden layer and one output layer. Word2Vec utilizes two architectures: CBOW (Continuous Bag of Words) and Skip Gram. Source

In [19]:
# Algorithm params
min_count = 3    # Minimium frequency count of words. The model would ignore words that do not satisfy the min_count
size = 150       # The size of the dense vector to represent each token or word
window = 5       # The maximum distance between the target word and its neighboring word
sg = 0           # Skip Gram algorithm = False. Continuous Bag of Words = True
iter = 10        # Number of iterations (epochs) over the corpus
In [20]:
# Create Word2Vec model
w2v_model = Word2Vec(all_words, min_count = min_count, size = size, window = window, sg = sg, iter = iter)

- Vocabulary: unique words of the document.

In [21]:
# Show vocabulary size: unique words occurring at least twice
vocabulary = w2v_model.wv.vocab
len(vocabulary)
Out[21]:
2841
In [22]:
# Show 'holmes' vector
w2v_model.wv['holmes']
Out[22]:
array([-2.55309530e-02, -3.80948305e-01, -5.94039083e-01, -2.29917169e-02,
        4.49464843e-02, -8.65593627e-02,  4.34779108e-01,  2.31260568e-01,
        2.26458758e-01, -2.08258107e-01,  1.72266632e-01, -1.29082978e-01,
        2.44869292e-01, -1.17141545e-01,  5.90690851e-01, -2.96591550e-01,
        2.24449947e-01, -1.58245921e-01,  1.81890696e-01, -2.02782795e-01,
       -5.25846817e-02,  3.63692641e-01, -1.90802664e-01,  1.88176900e-01,
       -3.66924942e-01, -1.27192616e-01,  1.91675514e-01, -3.87567252e-01,
       -3.63166749e-01,  3.19596350e-01, -2.12143674e-01,  5.80339320e-02,
       -3.64057496e-02, -1.06716089e-01,  5.41476190e-01,  3.03713679e-01,
        5.44609539e-02,  8.68673772e-02,  2.67600417e-01,  4.15813886e-02,
       -2.61329353e-01,  5.73478453e-02,  1.61343083e-01, -6.87725917e-02,
        3.33145618e-01,  2.72925168e-01, -3.30122769e-01, -1.35466665e-01,
        2.48068914e-01, -1.97158441e-01, -1.79745018e-01,  5.46169281e-01,
        9.66735259e-02,  1.22561000e-01,  2.60456234e-01, -1.13705464e-01,
       -6.32253885e-02,  1.97710529e-01,  1.53591707e-01,  2.37814516e-01,
       -1.38751134e-01,  3.86664003e-01, -1.72816023e-01, -5.67218401e-02,
        2.18985602e-01, -5.73132992e-01,  2.78956413e-01, -2.54947692e-01,
        1.20569445e-01, -1.22889489e-01,  4.59067762e-01, -2.85603464e-01,
        4.83051054e-02,  2.68255383e-01,  4.02651101e-01, -4.61243205e-02,
        1.82139084e-01, -4.16781157e-01, -2.01869741e-01, -7.90120900e-01,
        2.51291782e-01,  1.65222704e-01, -5.31209968e-02,  1.48148179e-01,
        1.09584242e-01,  3.72660816e-01,  3.57212365e-01,  2.07335949e-02,
       -4.02208239e-01,  8.86686593e-02,  3.36165726e-01,  3.12308729e-01,
        3.77913475e-01, -3.17593843e-01,  1.94186836e-01, -3.86472106e-01,
        7.97571167e-02,  1.34194985e-01, -1.77155539e-01, -1.02397770e-01,
       -1.06433243e-01,  8.15659016e-03,  2.97181159e-01,  2.13090658e-01,
        3.75765413e-01, -3.32017958e-01,  2.76264876e-01, -5.15093267e-01,
       -4.14226353e-02, -1.62738264e-01, -1.75253764e-01, -9.22462717e-02,
        6.85715154e-02,  1.81130618e-01, -1.20653450e-01,  1.15494609e-01,
       -8.07769597e-02,  1.33773284e-02, -1.25904173e-01, -2.64966227e-02,
       -1.43337194e-02, -2.23344207e-01,  1.83504462e-01, -3.22326660e-01,
        9.73469950e-03, -3.32587987e-01,  9.15770698e-03, -2.16075256e-01,
       -4.49939519e-01,  7.05705211e-02, -2.86682904e-01,  1.87116802e-01,
        4.51862663e-02,  3.03558081e-01, -1.90053627e-01, -1.83379501e-01,
       -6.38921978e-04,  4.76206630e-01, -6.36085808e-01, -4.58958089e-01,
        6.24753833e-01, -2.55594194e-01, -5.00372767e-01, -9.74199101e-02,
        3.28555405e-01, -3.44924539e-01,  4.76771779e-02, -2.28526443e-01,
       -1.49872839e-01, -1.11064635e-01], dtype=float32)

- Similar Words: Words more similar in terms of meaning and context.

In [23]:
# Finding Positive Similar Words
w2v_model.wv.most_similar(positive = ['holmes'], topn = 10)
Out[23]:
[('friend', 0.9997585415840149),
 ('away', 0.9997538924217224),
 ('took', 0.9997522830963135),
 ('case', 0.9997429847717285),
 ('hand', 0.999742329120636),
 ('asked', 0.9997420310974121),
 ('morning', 0.9997410178184509),
 ('like', 0.9997373819351196),
 ('came', 0.9997363090515137),
 ('left', 0.9997360706329346)]
In [24]:
# Finding Negative Similar Words
w2v_model.wv.most_similar(negative = ['holmes'], topn = 10)
Out[24]:
[('awkward', 0.7395398020744324),
 ('cruelly', 0.3742988705635071),
 ('iii', 0.08360050618648529),
 ('governess', -0.015460893511772156),
 ('faithfully', -0.23219306766986847),
 ('stolen', -0.6758560538291931),
 ('punishment', -0.6900795102119446),
 ('choose', -0.7777112722396851),
 ('instructive', -0.7786570191383362),
 ('season', -0.8168827295303345)]
In [25]:
# Calculate the similarity between 2 words
w2v_model.wv.similarity(w1 = 'holmes', w2 = 'watson')
Out[25]:
0.9996760599543573
In [26]:
# Calculate similarity: sim(w1, w2) = sim(w2, w1)
w2v_model.wv.similarity(w1 = 'watson', w2 = 'holmes')
Out[26]:
0.9996760599543573
In [27]:
# Show word that doesn't belong to the list
w2v_model.wv.doesnt_match(['holmes', 'watson', 'mycroft'])
Out[27]:
'watson'

Step 4 - Plot similars words

In [28]:
# Get vectors
target_word = 'sherlock'
top_n = 25

# Calculate more and less similars words
most_similar = w2v_model.wv.most_similar(positive = [target_word], topn = top_n)
less_similar = w2v_model.wv.most_similar(negative = [target_word], topn = top_n)

# Save them
neighbors = [(target_word, 1, 'current')]
neighbors += [(*row, 'most') for row in most_similar]
neighbors += [(*row, 'less') for row in less_similar]

# Get neighbors vectos
neigh_word = [row[0] for row in neighbors]
X = w2v_model[neigh_word]
len(X)
Out[28]:
51
In [29]:
# Perform PCA with 2 components
pca = PCA(n_components = 2)
pca_data = pca.fit_transform(X)

# The explained variance of each principal components
list(pca.explained_variance_ratio_)
Out[29]:
[0.9995707, 5.5996243e-05]
In [30]:
# Create and show principal components DataFrame
pca_df = pd.DataFrame(data = pca_data, columns = ["PC1", "PC2"])
pca_df['Name'] = neigh_word
pca_df.head(10)
Out[30]:
PC1 PC2 Name
0 0.246548 0.063437 sherlock
1 1.868599 0.023652 said
2 0.817528 0.008513 asked
3 0.525512 0.004849 remarked
4 0.387977 0.007799 watson
5 1.256235 -0.000606 chair
6 1.624056 -0.001593 hand
7 0.683683 0.000615 sat
8 2.660357 -0.006268 little
9 1.405960 -0.002510 head
In [31]:
# Create a scatter plot of the projection
fig, ax = plt.subplots(figsize = (16, 16))
gap = 0.001
colors = dict()
colors['current'] = 'royalblue'
colors['most'] = 'forestgreen'
colors['less'] = 'orange'

# Add points one by one with a loop
for i, word in enumerate(neigh_word):
    node_col = colors[neighbors[i][2]]
    
    if word == target_word:
        node_size = 100
        text = word.upper()
    else:
        node_size = 50
        text = word + ': ' + str(round(neighbors[i][1], 3))
        
    plt.scatter(pca_data[i, 0], pca_data[i, 1], c = node_col, s = node_size)
    plt.annotate(text, xy = (pca_data[i, 0] + gap*30, pca_data[i, 1] - gap/3))

# Plot setup
ax.set_xlabel("PC 1", fontsize = 12)
ax.set_ylabel("PC 2", fontsize = 12)
ax.set_title("Most Similar Words to " + target_word, fontsize = 20)
ax.legend(["Similar Words"])
ax.grid()

Step 5 - Export similarity between the Words

In [32]:
# Create Word2Vec model
w2v_model = Word2Vec(all_words, min_count = 5, size = 100, window = 5, sg = 0)
vocabulary = w2v_model.wv.vocab
len(vocabulary)
Out[32]:
1780

Dense matrix

In [33]:
# Returns the dense similarity between all the words in the document
def get_dense_similarity(model, precision = 3):
    words_sim = []
    vocabulary = model.wv.vocab
    
    for w1 in vocabulary:
        row_sim = []
        word_sim = 0
        for w2 in vocabulary:
            word_sim = model.wv.similarity(w1 = w1, w2 = w2)
            row_sim.append(round(word_sim, precision))
        words_sim.append(row_sim)
    
    return words_sim;
In [34]:
# Create dataframe with the similarity between all the words in the document
words_sim = get_dense_similarity(w2v_model, 2)
df_dense = pd.DataFrame.from_records(words_sim, columns = vocabulary)
print(df_dense.shape)
df_dense.iloc[:18, :18]
(1780, 1780)
Out[34]:
gutenberg adventures sherlock holmes arthur use away date english character set start cover contents scandal bohemia red headed
0 1.00 0.01 0.19 0.19 0.28 0.19 0.18 0.15 0.05 0.05 0.19 0.08 -0.11 0.20 0.16 0.21 0.18 0.19
1 0.01 1.00 0.54 0.54 0.46 0.52 0.54 0.19 0.25 0.51 0.54 0.53 0.23 0.20 0.42 0.34 0.51 0.53
2 0.19 0.54 1.00 0.97 0.86 0.92 0.97 0.39 0.63 0.82 0.95 0.86 0.49 0.60 0.71 0.72 0.96 0.89
3 0.19 0.54 0.97 1.00 0.86 0.94 0.99 0.44 0.65 0.84 0.97 0.89 0.47 0.59 0.73 0.77 0.98 0.92
4 0.28 0.46 0.86 0.86 1.00 0.79 0.86 0.32 0.55 0.73 0.85 0.75 0.38 0.52 0.66 0.71 0.85 0.81
5 0.19 0.52 0.92 0.94 0.79 1.00 0.94 0.35 0.57 0.77 0.93 0.83 0.44 0.57 0.66 0.68 0.93 0.89
6 0.18 0.54 0.97 0.99 0.86 0.94 1.00 0.45 0.65 0.84 0.98 0.87 0.46 0.58 0.71 0.75 0.98 0.92
7 0.15 0.19 0.39 0.44 0.32 0.35 0.45 1.00 0.45 0.44 0.42 0.42 0.14 0.21 0.21 0.35 0.43 0.36
8 0.05 0.25 0.63 0.65 0.55 0.57 0.65 0.45 1.00 0.63 0.65 0.64 0.44 0.39 0.41 0.46 0.65 0.56
9 0.05 0.51 0.82 0.84 0.73 0.77 0.84 0.44 0.63 1.00 0.83 0.75 0.44 0.49 0.65 0.57 0.83 0.79
10 0.19 0.54 0.95 0.97 0.85 0.93 0.98 0.42 0.65 0.83 1.00 0.87 0.48 0.57 0.70 0.72 0.97 0.92
11 0.08 0.53 0.86 0.89 0.75 0.83 0.87 0.42 0.64 0.75 0.87 1.00 0.48 0.51 0.62 0.66 0.86 0.82
12 -0.11 0.23 0.49 0.47 0.38 0.44 0.46 0.14 0.44 0.44 0.48 0.48 1.00 0.43 0.38 0.48 0.46 0.46
13 0.20 0.20 0.60 0.59 0.52 0.57 0.58 0.21 0.39 0.49 0.57 0.51 0.43 1.00 0.57 0.45 0.56 0.56
14 0.16 0.42 0.71 0.73 0.66 0.66 0.71 0.21 0.41 0.65 0.70 0.62 0.38 0.57 1.00 0.58 0.73 0.69
15 0.21 0.34 0.72 0.77 0.71 0.68 0.75 0.35 0.46 0.57 0.72 0.66 0.48 0.45 0.58 1.00 0.73 0.72
16 0.18 0.51 0.96 0.98 0.85 0.93 0.98 0.43 0.65 0.83 0.97 0.86 0.46 0.56 0.73 0.73 1.00 0.92
17 0.19 0.53 0.89 0.92 0.81 0.89 0.92 0.36 0.56 0.79 0.92 0.82 0.46 0.56 0.69 0.72 0.92 1.00
In [35]:
# Plot dense similarity matrix
fig, ax = plt.subplots(figsize = (14, 14))
sns.heatmap(words_sim, ax = ax)
ax.set_title("Dense Similarity Matrix", fontsize = 16)
plt.show()
In [36]:
# Exporting dense word similarity matrix
file_path = "../data/network/dense_similarity.csv"
df_dense.to_csv(file_path, index = False, sep = ',')

Sparse matrix

In [37]:
# Returns the sparse similarity between all the words in the document
def get_sparse_similarity(model, precision = 3, top_n = 10):
    matrix = []
    vocabulary = list(model.wv.vocab.keys())
    n_words = len(vocabulary)
    
    # Calculate sparse similarity
    for word in vocabulary:
        row_sim = np.zeros(n_words)
        best_sim = w2v_model.wv.most_similar(positive = [word], topn = top_n)
        
        for neighbor in best_sim:
            nei_name = neighbor[0]
            nei_ix = vocabulary.index(nei_name)
            nei_sim = round(neighbor[1], precision)
            row_sim[nei_ix] = nei_sim
        
        matrix.append(row_sim)
    
    return matrix, vocabulary;
In [38]:
# Create a data frame with the similarity between the nearest words
words_sim, vocabulary = get_sparse_similarity(w2v_model, 2, 10)
df_sparse = pd.DataFrame.from_records(words_sim, columns = vocabulary)
print(df_sparse.shape)
df_sparse.iloc[:18, :18]
(1780, 1780)
Out[38]:
gutenberg adventures sherlock holmes arthur use away date english character set start cover contents scandal bohemia red headed
0 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.00 0.0 0.0 0.99 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.99 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 0.0 0.0 0.0 0.00 0.0 0.0 0.98 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 0.0 0.0 0.0 0.89 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
13 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
15 0.0 0.0 0.0 0.77 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
16 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
17 0.0 0.0 0.0 0.00 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In [39]:
# Plot sparse similarity matrix
fig, ax = plt.subplots(figsize = (14, 14))
sns.heatmap(words_sim, ax = ax)
ax.set_title("Sparse Similarity Matrix", fontsize = 16)
plt.show()
In [40]:
# Exporting sparse word similarity matrix
file_path = "../data/network/sparse_similarity.csv"
df_sparse.to_csv(file_path, index = False, sep = ',')